{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IsYAkdZB7ZN9"
   },
   "source": [
    "![reranker.png]()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CnMJYir34Ntb"
   },
   "source": [
    "# Problem Statement:\n",
    "\n",
    "\n",
    "In a typical RAG pipeline, LLM Context window is limited so for a hypothetical 10000 pages document, we need to chunk the document. For any incoming user query, we need to fetch `Top-N` related chunks and because neither our Embedding are 100% accurate nor search algo is perfect, it could give us unrelated results too. This is a flaw in RAG pipeline. How can you deal with it? If you fetch Top-1 and the context is different then it's a sure bad answer. On the other hand, if you fetch more chunks and pass to LLM, it'll get confused and with higher number, it'll go out of context.\n",
    "\n",
    "# What's the remedy?\n",
    "\n",
    "Out of all the methods available, Re-ranking is the simplest. Idea is pretty simple.\n",
    "\n",
    "\n",
    "1. You assume that Embedding + Search algo are not 100% precise so you use Recall to your advantage and get similar high `N` (say 25) number of related chunks from corpus.\n",
    "\n",
    "2. Second step is to use a powerful model to increase the Precision. You re-rank above `N` queries again so that you can change the relative ordering and now select Top `K` queries (say 3) to pass as a context where `K` < `N` thus increasing the Precision.\n",
    "\n",
    "\n",
    "# Why can't you use the bigger model in the first place?\n",
    "Would your search results be better if you were searching in 100 vs 100000 documents? Yes, so no matter how big of a model you use, you'll always have some irrelevent results because of the huge domain.\n",
    "\n",
    "\n",
    "Smaller model with efficient searching algo does the work of searching in a bigger domain to get more number of elements while the larger model is precise and because it just works on `K`, there is a bit more overhead but improved relevancy.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wgPbKbpumkhH"
   },
   "source": [
    "## Credentials\n",
    "\n",
    "Copy and paste the project name and the api key from your project page.\n",
    "These will be used later to [connect to LanceDB Cloud](#scroll-to=5q8m6GMD7sGu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "rqEXT5-fmofw"
   },
   "outputs": [],
   "source": [
    "project_slug = \"your-project-slug\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "5LYmBomPmswi"
   },
   "outputs": [],
   "source": [
    "api_key = \"sk_...\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Xs6tr6CMnBrr"
   },
   "source": [
    "You can also set the LANCEDB_API_KEY as an environment variable. More details can be found <a href=\"https://github.com/lancedb/vectordb-recipes/tree/main/examples/RAG_Reranking/lancedb_cloud/README.md\">**here**</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3rAS8qvulTzf"
   },
   "source": [
    "### Installing dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5LCzoheJKW8X",
    "outputId": "aafa482b-e347-49d1-b37f-3a560b48d754"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m87.4/87.4 kB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.4/8.4 MB\u001b[0m \u001b[31m18.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m507.1/507.1 kB\u001b[0m \u001b[31m30.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m43.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m803.6/803.6 kB\u001b[0m \u001b[31m40.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m18.6/18.6 MB\u001b[0m \u001b[31m57.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m38.3/38.3 MB\u001b[0m \u001b[31m19.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m15.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m15.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.7/7.7 MB\u001b[0m \u001b[31m70.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m270.9/270.9 kB\u001b[0m \u001b[31m28.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.0/86.0 kB\u001b[0m \u001b[31m11.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.8/3.8 MB\u001b[0m \u001b[31m64.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m457.9/457.9 kB\u001b[0m \u001b[31m36.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m274.7/274.7 kB\u001b[0m \u001b[31m29.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m981.5/981.5 kB\u001b[0m \u001b[31m64.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.4/3.4 MB\u001b[0m \u001b[31m25.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m49.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m229.5/229.5 kB\u001b[0m \u001b[31m13.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.3/49.3 kB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m98.7/98.7 kB\u001b[0m \u001b[31m11.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.8/3.8 MB\u001b[0m \u001b[31m78.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m270.7/270.7 kB\u001b[0m \u001b[31m21.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m265.7/265.7 kB\u001b[0m \u001b[31m25.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.4/261.4 kB\u001b[0m \u001b[31m20.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.0/261.0 kB\u001b[0m \u001b[31m22.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m258.1/258.1 kB\u001b[0m \u001b[31m18.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m251.2/251.2 kB\u001b[0m \u001b[31m21.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m49.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25h  Building wheel for FlagEmbedding (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
      "  Building wheel for langdetect (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
      "  Building wheel for sentence_transformers (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
      "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
      "ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.\n",
      "tensorflow-probability 0.22.0 requires typing-extensions<4.6.0, but you have typing-extensions 4.9.0 which is incompatible.\u001b[0m\u001b[31m\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "!pip install -U lancedb transformers datasets FlagEmbedding unstructured langchain -qq\n",
    "\n",
    "# NOTE: If there is an import error, restart and run the notebook again"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0QQL4lm8lTzg"
   },
   "source": [
    "### Importing libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 508,
     "referenced_widgets": [
      "25222b6616e1489eb531950c958c5fdf",
      "6b6cd8dbe43440b29bf705ebc04fede7",
      "7b8c414f7aad49fe9f98760489d42ed1",
      "886c86257b4645a2a929eb39f20ab8a3",
      "f918256ef4874941a1ec098ea5050f6a",
      "187dee520d434a1eaab95c4b17723d35",
      "af680e36244e4f9691bb156d01c3b3b8",
      "413e5e69d5f246df9d530bcb797286d9",
      "ada52c81a40444eca27763305e25ef92",
      "3f5273f0ab8645368d73148784759cf1",
      "b72560a9b60348e1a2764f24a33188fa",
      "80d917f992794502aa6828ed7d01af98",
      "83c646c3a2f543949e2f02138e59e982",
      "89706a0fb9e34f97b3ff6db95e2e87b5",
      "109a9da70a6a4f6789fa397bf2a81fa4",
      "8d58def01f5d412589983675059926ac",
      "2fcb6704693b4842b7c4a224e2d916fd",
      "72c8de55a48d4fa8ae79b9289cbee1d4",
      "fb03d4b1113a42d1914557a26058d82f",
      "d6b75db679df4a849b362c77df481e30",
      "c2115e12648343a4b0ca23455c46f9a5",
      "a7c7e8bbdbbf44649af4e40be262a959",
      "db68c63dcd244ca9b8b391559f8abfd1",
      "158d4dd4e4f7495e9d2d6f360c29bf02",
      "16dd8588f2464d4281c0dde85cc28c6d",
      "ad2279ff1d4d47068d037ca698005140",
      "ca7870dc84ec48c681f6411595f321ad",
      "11e884ac25d64b0bba935a296b17d5d9",
      "eb6fa6ee9e74440bb4ce2a92ee4548c7",
      "9027c073a81a4360adbe90ed3bd9c099",
      "5fe367a2cb2b4f949b4b667d1d47e49c",
      "52e6dbcd90824ad096c1b610123df935",
      "62450a7fb0e540688ee9ad510a290609",
      "2f98cd9e8e5f4466a7a2cf88b087ce55",
      "5c07f16b1bb04a55bdcf0ae2476b78dd",
      "4e3741de8a1e4021a49e8abfb925c563",
      "0dfe271be6914892b327d306e669f4aa",
      "d88597e336ad4297a4aa6bd3d7fdc5dd",
      "1e011635dca34cae8ac44614bdfdf88c",
      "ba01a8212e0741c48d0d49095cfb5c17",
      "fd80d3b519124a759df834da4af06967",
      "5f419e6126ad421ea2efa5b73b38aef5",
      "0cb30ddb214540f8b74219a9fc77127b",
      "a6b63191503c43f691f28878fcd39b26",
      "4d174874270f4e0f88eb27a16aa0f11c",
      "2eed217c58f34c83b5b15bf2f955d9a0",
      "63a17bb1290a428da71bfe76a08e04db",
      "a35d2f0d849d474f82cbd3dc6879b12d",
      "f2f00b1f73954d95b95889fa1a34c5ae",
      "3fef4607ccd943008c272f66f9bf08b8",
      "0d7c92de0c384d72aafada73e685aa08",
      "c8ebd32170a44ab6bfedee79ea5509ec",
      "13e60c9491d542099f1a881330ee1c04",
      "1897e98d30ea4d3896c3c2a2b9b2c23e",
      "9019bbd2898447fc8c692163e223b4b1",
      "8a12ffea544146c29541b3b4a1c6db2b",
      "4a0c748f613a4a979f7eaee825878044",
      "c3be2b813b664e0cb9443c2aa0707afc",
      "9c594928a30a4b038c9779c23eaf9fb6",
      "aa7efcd26d21435a9b951a872d122c25",
      "bc9715c03c4d41f7b13cdc3fbca26b1b",
      "55efb72d3ac94e85b0259ade8318a84a",
      "8bd10565e96a46c2882d635a524593f2",
      "10d4116aace649bbae035c02e13828d8",
      "6ebe20cd50074cafad52f37818123cff",
      "e37f90735fd94698a4172e0292da7c9f",
      "9003eadfcead41aaaadabf18a706200f",
      "9e8a582cb44a4b618edc8d7844956a93",
      "22c4bb69842a43aa83fcba4eb6c1406c",
      "e191dbd0809c47c7ba28d3f6a0fcb1c5",
      "b7d5f50552744a528b99873318ee1bfc",
      "8b4cefc11b4a403eaafeafefdb0cd763",
      "077036d733a84d2881bd0f4d486277b4",
      "f22ae0abe3cb40a5b97559c7216400e7",
      "abc7b65193fb49f7899b10820992e163",
      "7410dd99fca940159ba8d13c9c52bae3",
      "6710ae95aea9445ab998afb5d0bb3241",
      "2be215a977184227abd5983e7c81b3ff",
      "6fb5d22a5b744b98907c7d36ad675e37",
      "4eca6411f44a4a638423a51289957959",
      "b99e200f694d4e2a83360752c6b5441c",
      "c61d80f6b40a4716b18ab7555fe604d1",
      "b7dfac1fc6f047d8a611ad711e18bbe7",
      "5913015323b349ac83f29dc3419ee468",
      "850b2174e35d49579060721becfe4287",
      "2bea46a2708d4e61910dc8138f99426b",
      "5a8b708edb414013a6915a9cbbe95f0a",
      "f451f6f10dfc45049f07c44e12b04836",
      "373fe974f094461d87f5b40ad6aa4e91",
      "f1dd6f744bf34a3aa1ecd72115f63155",
      "e5486dfd2ecd411eb60f3e3a89b64660",
      "5865b1aead9641d1bed54d9d166945f8",
      "41de8969ac714df5a4f4132f440af675",
      "5aae76de0c3b42fb81642990d8bbdf93",
      "cef20c5f499646e291329b580cf3800f",
      "90eac81b19a04354ad842f3fbe87e694",
      "cc666236572240f8b1015f187a2f66d9",
      "20f4ea917f4143cd9349fe3afa9c040d",
      "344dc2f380944602b0d5a712dad8473c",
      "2bfe8c958a9d44f781453b529255e01f",
      "e7a255788ec94998924142f4255ce409",
      "e409114ff67443ca92cb46ffb0697b58",
      "3e13281360324914921f135ba80e9672",
      "1d5908be44944e41a0b81875afb14411",
      "d30b92db93594fe8b2f83241bb498f78",
      "9f679d0c8c5e4bf59111200f955ae8d7",
      "944f4a261b5240408ab7fc473c7b0835",
      "32b556bd8ddc4ef196eba4d1fd6b6b62",
      "4c51bcbf31774cda86465a3ec707831d",
      "8bb2cf9274a84b84ac8d20d4d38aaecc",
      "3ab4015a66824bb3a2374d5a090e4e35",
      "63a499ae50bb4c12b4cf90dca53d0a07",
      "9d04d5b70be34666848a0347437cb7ea",
      "ab924099c7cc4d31a60afec68a0ff0d1",
      "92bbac1ac3da4331abc4c0afe7fffbd6",
      "bc01b443cb204014bcbcf7cb0fea4c86",
      "28a64ca9341341449b9e778d73db6321",
      "ba10edc885244040ab89498be10bd4db",
      "584e80ac477d4a01b23405a5fa29f092",
      "00a0f8d7b04c495b91b6decf446c50d5",
      "199779c3a63c4632be9c8fa65b7f33d8",
      "3e3303826e33485ca844fc82a1035b61",
      "c32fbe6f9b39480985e3bff59f6fcccd",
      "ad77277175bb456a9a6ce15af4aa5868",
      "a6b0d04284b748adb9e74530d25589e0",
      "2ae210dcb6bd47d584b980d478b254a2",
      "e95487a1fdac44fdb47b991d8ba87c3c",
      "dd3eef846edd4b5382b97ed6dce2c6d6",
      "498ff77afafa440cb0f6afb39627ec1d",
      "4015861324f64d4ab2b1a7a4153266ff",
      "7dd04c9ff2b34590ade55890b2f47b88",
      "c33d33427cc245eebe6e190684304904"
     ]
    },
    "id": "vP6d6JUShgqo",
    "outputId": "88a7acfb-e6c1-45e3-d94d-9c8517ba0e50"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: \n",
      "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
      "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
      "You will be able to reuse this secret in all of your notebooks.\n",
      "Please note that authentication is recommended but still optional to access public models or datasets.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "25222b6616e1489eb531950c958c5fdf",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "80d917f992794502aa6828ed7d01af98",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "db68c63dcd244ca9b8b391559f8abfd1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2f98cd9e8e5f4466a7a2cf88b087ce55",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4d174874270f4e0f88eb27a16aa0f11c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/731 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8a12ffea544146c29541b3b4a1c6db2b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9003eadfcead41aaaadabf18a706200f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2be215a977184227abd5983e7c81b3ff",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "373fe974f094461d87f5b40ad6aa4e91",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2bfe8c958a9d44f781453b529255e01f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3ab4015a66824bb3a2374d5a090e4e35",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3e3303826e33485ca844fc82a1035b61",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# All document present here https://github.com/FlagOpen/FlagEmbedding/tree/master\n",
    "\n",
    "from FlagEmbedding import FlagAutoModel, FlagAutoReranker\n",
    "import os\n",
    "import lancedb\n",
    "import re\n",
    "import pandas as pd\n",
    "import random\n",
    "\n",
    "from datasets import load_dataset\n",
    "\n",
    "import torch\n",
    "import gc\n",
    "\n",
    "from lancedb.embeddings import with_embeddings\n",
    "\n",
    "\n",
    "task = \"qa\"  # Encode for a specific task (qa, icl, chat, lrlm, tool, convsearch)\n",
    "\n",
    "# Load model (automatically use GPUs)\n",
    "embed_model = FlagAutoModel.from_finetuned(\"BAAI/bge-base-en\", use_fp16=False)\n",
    "\n",
    "# use_fp16 speeds up computation with a slight performance degradation\n",
    "reranker_model = FlagAutoReranker.from_finetuned(\n",
    "    \"BAAI/bge-reranker-base\", use_fp16=True\n",
    ")\n",
    "\n",
    "# For basic splitting\n",
    "# basic_text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64,) # 512 is the default Embedding model max_len\n",
    "\n",
    "# For Advanced Usage: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token\n",
    "# embedder_tokenizer = AutoTokenizer.from_pretrained(\"BAAI/llm-embedder\") # Advanced Tokenizer Splitter Strategy\n",
    "# advanced_text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(embedder_tokenizer, chunk_size=512, chunk_overlap=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8eKRYd2F7v5n"
   },
   "source": [
    "# Load `Chunks` of data from [BeIR Dataset](https://huggingface.co/datasets/BeIR/scidocs)\n",
    "\n",
    "Note: This is a dataset built specially for retrieval tasks to see how good your search is working"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 271,
     "referenced_widgets": [
      "4f46a6443368428193090a6a4ddf3473",
      "b2c7ed0a7fcd45229194b5422b7801d3",
      "434d6bee7bfc4aa5bab6c2a080a193e5",
      "7e9a93af505a4b1a96c947363bbae0b9",
      "453f440ade9946a5b12bfe6cea948368",
      "c92cde2b50804c18ace89bc605d1d6d1",
      "1d616b7a849f4a83a0f1d8dd446b96f7",
      "a938e445c6674332b9d6253358d0e1e0",
      "01a1a33b691d427bb5cccce1f4b79693",
      "33ba91ddcbd74a9e870ed3c2ba9f86d0",
      "803e1962e31b4a918c28e9ff20732313",
      "8a78b64d3f6b4ffdae7e79266d798635",
      "bb9b8200e6be4e18a8cf38b14b03e4ce",
      "a83bd13e5c6e4cbb96c2c2f7acfe8423",
      "471ea20537984f18a07b3a198750c3e0",
      "f79797ffd6a649b1a0edae63eee91bea",
      "ea704757587d4b09af079e555d6f57d1",
      "6dcfc1e9c851445e95b862f30dfc8dee",
      "e582de00e2af4948b9f072653c787712",
      "2887f8a70d8a45d5b633ec2106865a45",
      "fc2403f083124228befe690caad6dd3d",
      "a60c75bb501b49e48d09dd50cb645bdd",
      "a3eb172e9d324da1bd7d8914e66d2106",
      "166beb1aa15b4927a9c27fda4a8d6de1",
      "f2667dbbb4c5462986c9cad904767540",
      "7fd4ee74216249ae806b5d4045da9523",
      "b5fd41d1dba0476491cb311bd4d47741",
      "83474dc942a44919a4e48ee36b65f8f6",
      "77002ce5084c44b8b06987bee947f099",
      "dd9ca10fac4447bfaa7bd665a88e1033",
      "27b6a73af53d4cf1946ae2ece8c499e2",
      "bbbc8e741a0b44ef835e10fee58bbadf",
      "1fbd0891d5a24a54ae54656b0d8a6247",
      "501f90bcf1ff4efe81cd377df249415e",
      "aa29504327084ed5816d90ca3f9e9f16",
      "71c123550e2a4166955eef2f142170fb",
      "15b01052fc6140e2be8fae7c2d2928fa",
      "8badd98fa463404aa80f604b45f4a912",
      "407cf90293e44890903a4d89ee08008a",
      "18df3037afae470e8ac9d297f93fd9ce",
      "9c0796729bb0455d92a4f418e86fa38a",
      "5baf674c98a846e1a79fda9c8ee77e78",
      "11a0fb71fd0e486982215656adcd2bdc",
      "5be1ec5880cb459fb7a88ae7c1f2394f"
     ]
    },
    "id": "l0ezDr7suAf_",
    "outputId": "dc78db6c-480e-4c69-dade-561785bd7c85"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4f46a6443368428193090a6a4ddf3473",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading data:   0%|          | 0.00/93.3k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8a78b64d3f6b4ffdae7e79266d798635",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating queries split:   0%|          | 0/1000 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a3eb172e9d324da1bd7d8914e66d2106",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading data:   0%|          | 0.00/19.0M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "501f90bcf1ff4efe81cd377df249415e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating corpus split:   0%|          | 0/25657 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-c09c3758-6ef0-4828-a243-4f71a60db0b3\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>title</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>13181</th>\n",
       "      <td>6df3dc585e32f3b1cb49228d94a5469c30d79d2b</td>\n",
       "      <td>High Performance Computer Acoustic Data Accele...</td>\n",
       "      <td>This paper presents a new software model desig...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18678</th>\n",
       "      <td>784376563c94e231241fbcf71d4d2774aec4b935</td>\n",
       "      <td>A Comparison over Focused Web Crawling Strategies</td>\n",
       "      <td>In this paper we review and compare focused cr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4444</th>\n",
       "      <td>19751e0f81a103658bbac2506f5d5c8e06a1c06a</td>\n",
       "      <td>STDP-based spiking deep convolutional neural n...</td>\n",
       "      <td>Previous studies have shown that spike-timing-...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c09c3758-6ef0-4828-a243-4f71a60db0b3')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-c09c3758-6ef0-4828-a243-4f71a60db0b3 button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-c09c3758-6ef0-4828-a243-4f71a60db0b3');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-e6cf670f-c03b-4203-8362-0c1cda30496b\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-e6cf670f-c03b-4203-8362-0c1cda30496b')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-e6cf670f-c03b-4203-8362-0c1cda30496b button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                            _id  \\\n",
       "13181  6df3dc585e32f3b1cb49228d94a5469c30d79d2b   \n",
       "18678  784376563c94e231241fbcf71d4d2774aec4b935   \n",
       "4444   19751e0f81a103658bbac2506f5d5c8e06a1c06a   \n",
       "\n",
       "                                                   title  \\\n",
       "13181  High Performance Computer Acoustic Data Accele...   \n",
       "18678  A Comparison over Focused Web Crawling Strategies   \n",
       "4444   STDP-based spiking deep convolutional neural n...   \n",
       "\n",
       "                                                    text  \n",
       "13181  This paper presents a new software model desig...  \n",
       "18678  In this paper we review and compare focused cr...  \n",
       "4444   Previous studies have shown that spike-timing-...  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "queries = load_dataset(\"BeIR/scidocs\", \"queries\")[\"queries\"].to_pandas()\n",
    "docs = (\n",
    "    load_dataset(\"BeIR/scidocs\", \"corpus\")[\"corpus\"]\n",
    "    .to_pandas()\n",
    "    .dropna(subset=\"text\")\n",
    "    .sample(10000)\n",
    ")  # just random samples for faster embed demo\n",
    "docs.sample(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HJf8xZmX8VJC"
   },
   "source": [
    "# Get embedding using [`LLM embedder`](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder) and create Database using [`LanceDB Cloud`](https://lancedb.github.io/lancedb/cloud/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 49,
     "referenced_widgets": [
      "f6b5cd5ff9704a58986eff2c9c88db4c",
      "691424012c59434f8cc17f3d6aa001f3",
      "e2c52e10ab294bc68bcb1caadaf8d0c7",
      "ffba63bef7e944b7923b2b29f9495527",
      "5cd148642750417abd38cbf483ccf1f9",
      "8cad72468680488aa62c33186cedf084",
      "21b3538c53cd4ff5895a817782884101",
      "b1140f9f312441a8a32b8c7a9461baac",
      "4c955a42756d47eea9a00a87c4b5f0f0",
      "b8f872d00274483c951804adeef7c500",
      "cc951613d15d49bbb24409d46b06c1b6"
     ]
    },
    "id": "5aljyqpUiViE",
    "outputId": "20c89d94-dec1-4997-dbb9-c0d7bbad34ab"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f6b5cd5ff9704a58986eff2c9c88db4c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/79 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def embed_documents(batch):\n",
    "    \"\"\"\n",
    "    Function to embed the whole text data\n",
    "    \"\"\"\n",
    "    return embed_model.encode_keys(batch, task=task)  # Encode data or 'keys'\n",
    "\n",
    "\n",
    "uri = \"db://\" + project_slug\n",
    "db = lancedb.connect(uri, api_key=api_key, region=\"us-east-1\")\n",
    "table_name = \"doc_embed\"\n",
    "try:\n",
    "    # Use the train text chunk data to save embed in the DB\n",
    "    data = with_embeddings(\n",
    "        embed_documents, docs, column=\"text\", show_progress=True, batch_size=128\n",
    "    )\n",
    "    table = db.create_table(table_name, data=data)  # create Table\n",
    "except:\n",
    "    table = db.open_table(table_name)  # Open Table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PCufm9Xr8eWp"
   },
   "source": [
    "# Search from a random Text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 381
    },
    "id": "964Z2sZA247g",
    "outputId": "929efa44-e243-47ec-8921-4c4c4a8b5ece"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "QUERY:->  Classification of human activity by using a Stacked Autoencoder\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-3ad5739c-f8fb-495c-8af8-4f761bd4c4d5\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>title</th>\n",
       "      <th>text</th>\n",
       "      <th>vector</th>\n",
       "      <th>_distance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>83d323a5bb26b706d4f6d24eb27411a7e7ff57e6</td>\n",
       "      <td>Protective action of green tea catechins in ne...</td>\n",
       "      <td>Mitochondria are central players in the regula...</td>\n",
       "      <td>[-0.014866754, 0.0028244434, -0.023141732, 0.0...</td>\n",
       "      <td>0.281554</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a3345798b1faf238e8d805bbe9124b0b8e0c869f</td>\n",
       "      <td>Autophagy as a regulated pathway of cellular d...</td>\n",
       "      <td>Macroautophagy is a dynamic process involving ...</td>\n",
       "      <td>[-0.042504933, 0.00053501845, -0.016986104, 0....</td>\n",
       "      <td>0.312909</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>e0534bfb477c5a82e98d0cb386ae3eb31d349c91</td>\n",
       "      <td>Cellular and molecular mechanisms of hepatocel...</td>\n",
       "      <td>Hepatocellular carcinoma (HCC) is the most com...</td>\n",
       "      <td>[0.03984485, 0.01583628, -0.00934351, -0.02993...</td>\n",
       "      <td>0.366526</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>c65945c08b7fd77ffd2c53369e8928699c3993e7</td>\n",
       "      <td>Comparing Alzheimer’s and Parkinson’s diseases...</td>\n",
       "      <td>Recent advances in large datasets analysis off...</td>\n",
       "      <td>[-0.004613025, -0.0044279257, -0.013920496, 0....</td>\n",
       "      <td>0.369777</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1af2e075903a3cc5ad5a192921a0b4fb67645dc1</td>\n",
       "      <td>Mathematical models of cancer metabolism.</td>\n",
       "      <td>Metabolism is essential for life, and its alte...</td>\n",
       "      <td>[-0.0037386382, 0.011562068, -0.022479024, 0.0...</td>\n",
       "      <td>0.370503</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>3979cf5a013063e98ad0caf2e7110c2686cf1640</td>\n",
       "      <td>Basic local alignment search tool.</td>\n",
       "      <td>A new approach to rapid sequence comparison, b...</td>\n",
       "      <td>[-0.006935188, 0.020925103, -0.051218845, 0.00...</td>\n",
       "      <td>0.372769</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0fb926cae217b70c97c74eb70b2a6b8c47574812</td>\n",
       "      <td>Network biology: understanding the cell's func...</td>\n",
       "      <td>A key aim of postgenomic biomedical research i...</td>\n",
       "      <td>[0.012990677, 0.028128441, -0.006426807, -0.02...</td>\n",
       "      <td>0.376812</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>488257dcbc7bcb56836f10a410e69c2c283989e5</td>\n",
       "      <td>mTOR Signaling in Growth Control and Disease</td>\n",
       "      <td>The mechanistic target of rapamycin (mTOR) sig...</td>\n",
       "      <td>[0.0006567143, 0.0053487234, -0.0010087299, -0...</td>\n",
       "      <td>0.376821</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>65f415c6d88aca139867702fc64aa179781b8e65</td>\n",
       "      <td>PID: the Pathway Interaction Database</td>\n",
       "      <td>The Pathway Interaction Database (PID, http://...</td>\n",
       "      <td>[-0.007852315, 0.014019204, -0.026789214, -0.0...</td>\n",
       "      <td>0.378377</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>244fc78ce607812edb90290727dab4d33377e986</td>\n",
       "      <td>Transfer of mitochondria via tunneling nanotub...</td>\n",
       "      <td>Tunneling nanotubes (TNTs) are F-actin-based m...</td>\n",
       "      <td>[-0.0063375738, 0.006348416, -0.034239322, 0.0...</td>\n",
       "      <td>0.380112</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-3ad5739c-f8fb-495c-8af8-4f761bd4c4d5')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-3ad5739c-f8fb-495c-8af8-4f761bd4c4d5 button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-3ad5739c-f8fb-495c-8af8-4f761bd4c4d5');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-8708ac61-e986-4d7b-8b33-0e905137e75f\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-8708ac61-e986-4d7b-8b33-0e905137e75f')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-8708ac61-e986-4d7b-8b33-0e905137e75f button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                        _id  \\\n",
       "0  83d323a5bb26b706d4f6d24eb27411a7e7ff57e6   \n",
       "1  a3345798b1faf238e8d805bbe9124b0b8e0c869f   \n",
       "2  e0534bfb477c5a82e98d0cb386ae3eb31d349c91   \n",
       "3  c65945c08b7fd77ffd2c53369e8928699c3993e7   \n",
       "4  1af2e075903a3cc5ad5a192921a0b4fb67645dc1   \n",
       "5  3979cf5a013063e98ad0caf2e7110c2686cf1640   \n",
       "6  0fb926cae217b70c97c74eb70b2a6b8c47574812   \n",
       "7  488257dcbc7bcb56836f10a410e69c2c283989e5   \n",
       "8  65f415c6d88aca139867702fc64aa179781b8e65   \n",
       "9  244fc78ce607812edb90290727dab4d33377e986   \n",
       "\n",
       "                                               title  \\\n",
       "0  Protective action of green tea catechins in ne...   \n",
       "1  Autophagy as a regulated pathway of cellular d...   \n",
       "2  Cellular and molecular mechanisms of hepatocel...   \n",
       "3  Comparing Alzheimer’s and Parkinson’s diseases...   \n",
       "4          Mathematical models of cancer metabolism.   \n",
       "5                 Basic local alignment search tool.   \n",
       "6  Network biology: understanding the cell's func...   \n",
       "7       mTOR Signaling in Growth Control and Disease   \n",
       "8              PID: the Pathway Interaction Database   \n",
       "9  Transfer of mitochondria via tunneling nanotub...   \n",
       "\n",
       "                                                text  \\\n",
       "0  Mitochondria are central players in the regula...   \n",
       "1  Macroautophagy is a dynamic process involving ...   \n",
       "2  Hepatocellular carcinoma (HCC) is the most com...   \n",
       "3  Recent advances in large datasets analysis off...   \n",
       "4  Metabolism is essential for life, and its alte...   \n",
       "5  A new approach to rapid sequence comparison, b...   \n",
       "6  A key aim of postgenomic biomedical research i...   \n",
       "7  The mechanistic target of rapamycin (mTOR) sig...   \n",
       "8  The Pathway Interaction Database (PID, http://...   \n",
       "9  Tunneling nanotubes (TNTs) are F-actin-based m...   \n",
       "\n",
       "                                              vector  _distance  \n",
       "0  [-0.014866754, 0.0028244434, -0.023141732, 0.0...   0.281554  \n",
       "1  [-0.042504933, 0.00053501845, -0.016986104, 0....   0.312909  \n",
       "2  [0.03984485, 0.01583628, -0.00934351, -0.02993...   0.366526  \n",
       "3  [-0.004613025, -0.0044279257, -0.013920496, 0....   0.369777  \n",
       "4  [-0.0037386382, 0.011562068, -0.022479024, 0.0...   0.370503  \n",
       "5  [-0.006935188, 0.020925103, -0.051218845, 0.00...   0.372769  \n",
       "6  [0.012990677, 0.028128441, -0.006426807, -0.02...   0.376812  \n",
       "7  [0.0006567143, 0.0053487234, -0.0010087299, -0...   0.376821  \n",
       "8  [-0.007852315, 0.014019204, -0.026789214, -0.0...   0.378377  \n",
       "9  [-0.0063375738, 0.006348416, -0.034239322, 0.0...   0.380112  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def search(query, top_k=10):\n",
    "    \"\"\"\n",
    "    Search a query from the table\n",
    "    \"\"\"\n",
    "    query_vector = embed_model.encode_queries(\n",
    "        query\n",
    "    )  # Encode the QUERY (it is done differently than the 'key')\n",
    "    search_results = table.search(query_vector).limit(top_k)\n",
    "    return search_results\n",
    "\n",
    "\n",
    "query = random.choice(queries[\"text\"])\n",
    "print(\"QUERY:-> \", query)\n",
    "\n",
    "# get top_k search results\n",
    "search_results = (\n",
    "    search(\"what is mitochondria?\", top_k=10)\n",
    "    .to_pandas()\n",
    "    .dropna(subset=\"text\")\n",
    "    .reset_index(drop=True)\n",
    ")\n",
    "\n",
    "search_results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cntAuaUU_TER"
   },
   "source": [
    "# Rerank Search Results using Reranker from [`BGE Reranker`](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)\n",
    "\n",
    "Pass all the results to a stronger model to give them the similarity ranking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 554
    },
    "id": "dHw2DSAj3u9B",
    "outputId": "a089995b-c652-4075-dc47-6a851f9026ea"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "QUERY:->  Classification of human activity by using a Stacked Autoencoder\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "  <div id=\"df-202a8f8a-8d1b-46d5-985c-f1fc42580a4e\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_id</th>\n",
       "      <th>title</th>\n",
       "      <th>text</th>\n",
       "      <th>vector</th>\n",
       "      <th>_distance</th>\n",
       "      <th>old_similarity_rank</th>\n",
       "      <th>new_scores</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a3345798b1faf238e8d805bbe9124b0b8e0c869f</td>\n",
       "      <td>Autophagy as a regulated pathway of cellular d...</td>\n",
       "      <td>Macroautophagy is a dynamic process involving ...</td>\n",
       "      <td>[-0.042504933, 0.00053501845, -0.016986104, 0....</td>\n",
       "      <td>0.312909</td>\n",
       "      <td>2</td>\n",
       "      <td>-3.949219</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3979cf5a013063e98ad0caf2e7110c2686cf1640</td>\n",
       "      <td>Basic local alignment search tool.</td>\n",
       "      <td>A new approach to rapid sequence comparison, b...</td>\n",
       "      <td>[-0.006935188, 0.020925103, -0.051218845, 0.00...</td>\n",
       "      <td>0.372769</td>\n",
       "      <td>6</td>\n",
       "      <td>-5.410156</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>83d323a5bb26b706d4f6d24eb27411a7e7ff57e6</td>\n",
       "      <td>Protective action of green tea catechins in ne...</td>\n",
       "      <td>Mitochondria are central players in the regula...</td>\n",
       "      <td>[-0.014866754, 0.0028244434, -0.023141732, 0.0...</td>\n",
       "      <td>0.281554</td>\n",
       "      <td>1</td>\n",
       "      <td>-6.652344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>65f415c6d88aca139867702fc64aa179781b8e65</td>\n",
       "      <td>PID: the Pathway Interaction Database</td>\n",
       "      <td>The Pathway Interaction Database (PID, http://...</td>\n",
       "      <td>[-0.007852315, 0.014019204, -0.026789214, -0.0...</td>\n",
       "      <td>0.378377</td>\n",
       "      <td>9</td>\n",
       "      <td>-7.402344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0fb926cae217b70c97c74eb70b2a6b8c47574812</td>\n",
       "      <td>Network biology: understanding the cell's func...</td>\n",
       "      <td>A key aim of postgenomic biomedical research i...</td>\n",
       "      <td>[0.012990677, 0.028128441, -0.006426807, -0.02...</td>\n",
       "      <td>0.376812</td>\n",
       "      <td>7</td>\n",
       "      <td>-7.824219</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1af2e075903a3cc5ad5a192921a0b4fb67645dc1</td>\n",
       "      <td>Mathematical models of cancer metabolism.</td>\n",
       "      <td>Metabolism is essential for life, and its alte...</td>\n",
       "      <td>[-0.0037386382, 0.011562068, -0.022479024, 0.0...</td>\n",
       "      <td>0.370503</td>\n",
       "      <td>5</td>\n",
       "      <td>-8.070312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>c65945c08b7fd77ffd2c53369e8928699c3993e7</td>\n",
       "      <td>Comparing Alzheimer’s and Parkinson’s diseases...</td>\n",
       "      <td>Recent advances in large datasets analysis off...</td>\n",
       "      <td>[-0.004613025, -0.0044279257, -0.013920496, 0....</td>\n",
       "      <td>0.369777</td>\n",
       "      <td>4</td>\n",
       "      <td>-9.007812</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>488257dcbc7bcb56836f10a410e69c2c283989e5</td>\n",
       "      <td>mTOR Signaling in Growth Control and Disease</td>\n",
       "      <td>The mechanistic target of rapamycin (mTOR) sig...</td>\n",
       "      <td>[0.0006567143, 0.0053487234, -0.0010087299, -0...</td>\n",
       "      <td>0.376821</td>\n",
       "      <td>8</td>\n",
       "      <td>-9.507812</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>244fc78ce607812edb90290727dab4d33377e986</td>\n",
       "      <td>Transfer of mitochondria via tunneling nanotub...</td>\n",
       "      <td>Tunneling nanotubes (TNTs) are F-actin-based m...</td>\n",
       "      <td>[-0.0063375738, 0.006348416, -0.034239322, 0.0...</td>\n",
       "      <td>0.380112</td>\n",
       "      <td>10</td>\n",
       "      <td>-9.593750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>e0534bfb477c5a82e98d0cb386ae3eb31d349c91</td>\n",
       "      <td>Cellular and molecular mechanisms of hepatocel...</td>\n",
       "      <td>Hepatocellular carcinoma (HCC) is the most com...</td>\n",
       "      <td>[0.03984485, 0.01583628, -0.00934351, -0.02993...</td>\n",
       "      <td>0.366526</td>\n",
       "      <td>3</td>\n",
       "      <td>-10.195312</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-202a8f8a-8d1b-46d5-985c-f1fc42580a4e')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-202a8f8a-8d1b-46d5-985c-f1fc42580a4e button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-202a8f8a-8d1b-46d5-985c-f1fc42580a4e');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-0e93b529-df8e-4e12-8e1b-b07a39857d18\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-0e93b529-df8e-4e12-8e1b-b07a39857d18')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-0e93b529-df8e-4e12-8e1b-b07a39857d18 button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                        _id  \\\n",
       "0  a3345798b1faf238e8d805bbe9124b0b8e0c869f   \n",
       "1  3979cf5a013063e98ad0caf2e7110c2686cf1640   \n",
       "2  83d323a5bb26b706d4f6d24eb27411a7e7ff57e6   \n",
       "3  65f415c6d88aca139867702fc64aa179781b8e65   \n",
       "4  0fb926cae217b70c97c74eb70b2a6b8c47574812   \n",
       "5  1af2e075903a3cc5ad5a192921a0b4fb67645dc1   \n",
       "6  c65945c08b7fd77ffd2c53369e8928699c3993e7   \n",
       "7  488257dcbc7bcb56836f10a410e69c2c283989e5   \n",
       "8  244fc78ce607812edb90290727dab4d33377e986   \n",
       "9  e0534bfb477c5a82e98d0cb386ae3eb31d349c91   \n",
       "\n",
       "                                               title  \\\n",
       "0  Autophagy as a regulated pathway of cellular d...   \n",
       "1                 Basic local alignment search tool.   \n",
       "2  Protective action of green tea catechins in ne...   \n",
       "3              PID: the Pathway Interaction Database   \n",
       "4  Network biology: understanding the cell's func...   \n",
       "5          Mathematical models of cancer metabolism.   \n",
       "6  Comparing Alzheimer’s and Parkinson’s diseases...   \n",
       "7       mTOR Signaling in Growth Control and Disease   \n",
       "8  Transfer of mitochondria via tunneling nanotub...   \n",
       "9  Cellular and molecular mechanisms of hepatocel...   \n",
       "\n",
       "                                                text  \\\n",
       "0  Macroautophagy is a dynamic process involving ...   \n",
       "1  A new approach to rapid sequence comparison, b...   \n",
       "2  Mitochondria are central players in the regula...   \n",
       "3  The Pathway Interaction Database (PID, http://...   \n",
       "4  A key aim of postgenomic biomedical research i...   \n",
       "5  Metabolism is essential for life, and its alte...   \n",
       "6  Recent advances in large datasets analysis off...   \n",
       "7  The mechanistic target of rapamycin (mTOR) sig...   \n",
       "8  Tunneling nanotubes (TNTs) are F-actin-based m...   \n",
       "9  Hepatocellular carcinoma (HCC) is the most com...   \n",
       "\n",
       "                                              vector  _distance  \\\n",
       "0  [-0.042504933, 0.00053501845, -0.016986104, 0....   0.312909   \n",
       "1  [-0.006935188, 0.020925103, -0.051218845, 0.00...   0.372769   \n",
       "2  [-0.014866754, 0.0028244434, -0.023141732, 0.0...   0.281554   \n",
       "3  [-0.007852315, 0.014019204, -0.026789214, -0.0...   0.378377   \n",
       "4  [0.012990677, 0.028128441, -0.006426807, -0.02...   0.376812   \n",
       "5  [-0.0037386382, 0.011562068, -0.022479024, 0.0...   0.370503   \n",
       "6  [-0.004613025, -0.0044279257, -0.013920496, 0....   0.369777   \n",
       "7  [0.0006567143, 0.0053487234, -0.0010087299, -0...   0.376821   \n",
       "8  [-0.0063375738, 0.006348416, -0.034239322, 0.0...   0.380112   \n",
       "9  [0.03984485, 0.01583628, -0.00934351, -0.02993...   0.366526   \n",
       "\n",
       "   old_similarity_rank  new_scores  \n",
       "0                    2   -3.949219  \n",
       "1                    6   -5.410156  \n",
       "2                    1   -6.652344  \n",
       "3                    9   -7.402344  \n",
       "4                    7   -7.824219  \n",
       "5                    5   -8.070312  \n",
       "6                    4   -9.007812  \n",
       "7                    8   -9.507812  \n",
       "8                   10   -9.593750  \n",
       "9                    3  -10.195312  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def rerank(query, search_results):\n",
    "    search_results[\"old_similarity_rank\"] = search_results.index + 1  # Old ranks\n",
    "\n",
    "    torch.cuda.empty_cache()\n",
    "    gc.collect()\n",
    "\n",
    "    search_results[\"new_scores\"] = reranker_model.compute_score(\n",
    "        [[query, chunk] for chunk in search_results[\"text\"]]\n",
    "    )  # Re compute ranks\n",
    "    return search_results.sort_values(by=\"new_scores\", ascending=False).reset_index(\n",
    "        drop=True\n",
    "    )\n",
    "\n",
    "\n",
    "print(\"QUERY:-> \", query)\n",
    "\n",
    "rerank(query, search_results)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
